Improves hanging flakiness in CI tests#5479
Conversation
There was a problem hiding this comment.
🤖 Isaac Lab Review Bot
Summary
This PR introduces temporary debugging instrumentation to investigate intermittent CI timeouts in PhysX articulation tests. It adds verbose print statements to test_external_force_on_single_body_at_position, narrows CI execution to only 3 specific test files, disables timeout retries, and significantly reduces some test timeouts. This is explicitly marked "DO NOT MERGE" — it's a debug-focused investigation branch, not a production change.
Architecture Impact
High-impact CI behavior change: When FOCUS_ARTICULATION_HANG_DEBUG = True (hardcoded), the entire CI test suite is reduced to only 3 test files. This will cause all other tests to be skipped in any CI run using this branch. The conftest.py changes affect the global test collection logic, meaning this would break normal CI if merged.
Implementation Verdict
Needs rework — While appropriate as a temporary debug branch, several issues need attention even for investigation purposes, and the PR absolutely cannot be merged in its current state.
Test Coverage
This is a debug instrumentation PR, not a feature or bug fix, so new test coverage is not applicable. However, the changes reduce test coverage by limiting CI to only 3 files.
CI Status
No CI checks available yet. When CI runs, it will only execute test_articulation.py, test_surface_gripper.py, and test_non_headless_launch.py due to the hardcoded debug focus mode.
Findings
🔴 Critical: tools/conftest.py:51-63 — Debug mode hardcoded to True will break CI for all other tests
FOCUS_ARTICULATION_HANG_DEBUG = TrueThis hardcoded True value combined with the logic at line 687-692 forces filter_pattern to only include 3 test files, skipping the entire rest of the test suite. If this PR were merged (despite the title), CI would effectively stop testing 99%+ of the codebase. The constant should at minimum be controlled by an environment variable for safety:
FOCUS_ARTICULATION_HANG_DEBUG = os.environ.get("FOCUS_ARTICULATION_HANG_DEBUG", "false") == "true"🔴 Critical: tools/test_settings.py:64 — Timeout reduced from 3000s to 300s (10x reduction) is likely too aggressive
"test_surface_gripper.py": 300,This is one of the tests being debugged for timeout issues. Reducing its timeout by 10x while investigating hangs seems counterproductive — it will cause the test to timeout faster and potentially mask the actual hang location. If the test legitimately takes longer, this will cause false failures.
🟡 Warning: source/isaaclab_physx/test/assets/test_articulation.py:1021-1027 — Debug prints inside tight loop will significantly slow test execution
for step_i in range(100):
# ...
print(f"{debug_prefix}: outer loop {outer_i} step {step_i} before write_data_to_sim", flush=True)
# ...
print(f"{debug_prefix}: outer loop {outer_i} step {step_i} before sim.step", flush=True)
# ...
print(f"{debug_prefix}: outer loop {outer_i} step {step_i} after sim.step", flush=True)
# ...
print(f"{debug_prefix}: outer loop {outer_i} step {step_i} after articulation.update", flush=True)With num_articulations in [1, 2] and device in ["cuda:0", "cpu"], this creates 4 test instances × 5 outer loops × 100 inner steps × 4 prints = 8000 print statements with flush=True. This I/O overhead could materially affect timing characteristics and potentially mask or move the hang location. Consider printing only every N steps (e.g., if step_i % 25 == 0).
🟡 Warning: tools/conftest.py:51 — TIMEOUT_RETRIES set to 0 may cause flaky test failures
TIMEOUT_RETRIES = 0Disabling retries is intentional for debugging, but combined with the reduced timeouts, this increases the chance of false-positive failures. The comment and variable should document this is temporary.
🔵 Improvement: source/isaaclab_physx/test/assets/test_articulation.py:969 — Loop variable shadowing
for outer_i in range(5):
# ...
for i in range(num_articulations):
assert articulation.data.root_pos_w.torch[i, 2].item() < 0.2The inner loop uses i which previously was the outer loop variable (changed to outer_i). This is fine, but the inner loop variable i should perhaps be renamed to env_idx or similar for clarity in the debug output context.
🔵 Improvement: tools/conftest.py:368-372 — Debug branch adds -k filter only for articulation test but runs all tests in other focused files
if "test_articulation.py" in normalized_test_file:
cmd.extend(["-k", FOCUS_ARTICULATION_HANG_TEST_EXPR])This means test_surface_gripper.py and test_non_headless_launch.py will run all their tests, while test_articulation.py only runs test_external_force_on_single_body_at_position. If the hang is in one of the other files, all tests in those files will run. Consider whether this is the intended behavior or if similar filtering should apply.
🔵 Improvement: tools/test_settings.py:20 — Timeout reduced from 1500s to 1000s without justification
"test_articulation.py": 1000,The timeout was reduced by 500s. While 1000s is still substantial, the change isn't explained. If this is part of the debug investigation (expecting faster failure), it should be documented.
Greptile SummaryThis PR addresses CI flakiness caused by two distinct root causes: Kit startup being killed before it finishes on slow workers, and
Confidence Score: 5/5Safe to merge — all three changes are narrowly scoped fixes with no regressions on the happy path. The surface-gripper reordering is a one-line-swap with a clear before/after; the startup deadline bump is a constant change; and the CI skip is guarded behind an env-var flag rather than touching production logic. No existing passing tests are broken, and the only coverage removed is a test that was actively causing deadlocks in CI. The Important Files Changed
Sequence DiagramsequenceDiagram
participant CI as CI Runner
participant CF as conftest.py
participant SG as SurfaceGripper._initialize_impl()
participant EXT as isaacsim.robot.surface_gripper
Note over CI,CF: STARTUP_DEADLINE: 45s to 120s
CI->>CF: launch test subprocess
CF->>CF: wait for AppLauncher init complete
alt startup within 120s
CF->>CF: continue test run
else no signal after 120s
CF->>CI: kill + report startup_hang
end
Note over SG,EXT: surface_gripper.py reordering
SG->>SG: check self._device != cpu
alt device is CUDA
SG-->>CI: raise Exception fast fail no extension load
else device is CPU
SG->>EXT: enable_extension
EXT-->>SG: GripperView loaded
SG->>SG: continue initialisation
end
Note over CI,SG: test_surface_gripper.py in CI
CI->>CI: _RUNNING_CI = True
CI->>SG: test_initialization SKIPPED deadlock risk
CI->>SG: test_raise_error_if_not_cpu cuda runs expects Exception
Reviews (2): Last reviewed commit: "add changelog fragment" | Re-trigger Greptile |
| "test_shadow_hand_vision_presets.py": 5000, | ||
| "test_environments_newton.py": 5000, | ||
| "test_surface_gripper.py": 3000, | ||
| "test_surface_gripper.py": 300, |
There was a problem hiding this comment.
Aggressive timeout reduction for
test_surface_gripper.py
test_surface_gripper.py has been cut from 3000 s to 300 s (10×). Because test_surface_gripper.py is also in FOCUS_ARTICULATION_HANG_TEST_PATHS, it will be scheduled and run in the debug CI pass — with TIMEOUT_RETRIES = 0. If the surface-gripper suite legitimately takes more than 300 s (the original 3000 s allocation strongly implies it can), every run will produce a spurious hard-timeout failure that is indistinguishable from the hang under investigation, making diagnosis harder rather than easier.
| TIMEOUT_RETRIES = 0 | ||
| """Number of times to retry a test that reaches its hard timeout before giving up.""" | ||
|
|
||
| FOCUS_ARTICULATION_HANG_DEBUG = True |
There was a problem hiding this comment.
FOCUS_ARTICULATION_HANG_DEBUG has no env-variable escape hatch
FOCUS_ARTICULATION_HANG_DEBUG = True is hardcoded with no way to override it at runtime (e.g., via os.environ.get("FOCUS_ARTICULATION_HANG_DEBUG", "false") == "true"). When this flag is True it unconditionally overrides filter_pattern in pytest_sessionstart, silently ignoring any TEST_FILTER_PATTERN env var or --include-pattern CLI option that a CI job already supplies. Any CI workflow that is not the debug job will inadvertently run only the 3 focus files until this constant is reverted manually.
| TIMEOUT_RETRIES = 0 | ||
| """Number of times to retry a test that reaches its hard timeout before giving up.""" |
There was a problem hiding this comment.
TIMEOUT_RETRIES = 0 affects all tests, not just focus tests
Setting retries to 0 removes the safety net for every test in the session, not only the three focus files. Tests that are intermittently slow (e.g., shader-cache cold starts, resource-contended CI runners) and previously relied on the two retries will now fail permanently. Consider scoping the no-retry behavior to FOCUS_ARTICULATION_HANG_TEST_PATHS entries only, so other tests retain their retry budget.
| body_ids=body_ids, | ||
| is_global=is_global, | ||
| ) | ||
| print(f"{debug_prefix}: outer loop {outer_i} after set/add wrench", flush=True) | ||
| # perform simulation | ||
| for _ in range(100): | ||
| for step_i in range(100): | ||
| # apply action to the articulation |
There was a problem hiding this comment.
Per-step prints produce ~1 000+ lines per test parameterization
Two print calls are placed inside the 100-step inner loop (before write_data_to_sim and after articulation.update). With -s capturing disabled and 5 outer iterations, that is at minimum 1 000 lines per (device, num_articulations) combination. Across all parameterized combinations this becomes very verbose. Moving these traces outside the inner loop, or gating them with a step_i % 10 == 0 condition, would preserve diagnostic value without flooding logs.
There was a problem hiding this comment.
🤖 Isaac Lab Review Bot
Summary
This is a follow-up review. The new commits add debug instrumentation to test_non_headless_launch.py and the Newton articulation test file, and switch from print() to sys.__stdout__.write() for bypass-buffering debug output. The core debug scaffolding remains unchanged.
Previous Concerns Status
- TIMEOUT_RETRIES = 0: Still set to 0, but this is intentional for debugging — acknowledged.
- Hardcoded FOCUS_ARTICULATION_HANG_DEBUG = True: Still hardcoded, but given "[DO NOT MERGE]" title, this is acceptable for the debug branch.
- Aggressive timeout reductions: Still present (300s for surface_gripper, 1000s for articulation).
- Debug prints in tight loop: Now using
sys.__stdout__.write()with explicitflush(), which is marginally better thanprint(flush=True)for bypassing Python buffering, but the I/O overhead concern in the 100-step inner loop remains.
New Findings
🔵 Improvement: Consistent debug instrumentation pattern across files
The new commits add sys.__stdout__.write() debug markers to test_non_headless_launch.py and test_surface_gripper.py, which is consistent with the debug investigation approach. The Newton articulation test now mirrors the PhysX version's detailed step-by-step logging. This is appropriate for the debugging purpose.
No new critical or warning-level issues introduced. The PR remains a debug-only branch as indicated by the title.
There was a problem hiding this comment.
🤖 Isaac Lab Review Bot
Summary
This is a follow-up review. The new commit (7296571) makes no substantive changes — the diff between the previous review and this commit shows identical content. The debug instrumentation, FOCUS_ARTICULATION_HANG_DEBUG = True, TIMEOUT_RETRIES = 0, and reduced timeouts all remain as previously noted.
Status
All previous observations remain valid and unchanged. No new issues introduced. The PR continues to be appropriate for its stated "[DO NOT MERGE]" debug investigation purpose.
There was a problem hiding this comment.
🤖 Isaac Lab Review Bot
Follow-up Review
No substantive changes between the previous review commit and c9a364b — the debug instrumentation remains identical. The PR continues to serve its stated "[DO NOT MERGE]" debug investigation purpose with FOCUS_ARTICULATION_HANG_DEBUG = True, disabled timeout retries, and verbose per-step logging to diagnose the intermittent articulation hang.
CI Status Note
The widespread CI failures (isaaclab_physx, isaaclab_newton, etc.) are expected since FOCUS_ARTICULATION_HANG_DEBUG = True restricts test execution to only the articulation-related debug files, causing normal test discovery to be skipped.
Verdict: Appropriate for debug investigation branch. Do not merge to develop.
There was a problem hiding this comment.
🤖 Isaac Lab Review Bot
Follow-up Review
The new commit (0f09c3b) adds one minor change: moving the enable_extension() and from isaacsim.robot.surface_gripper import GripperView import after the CPU device check in surface_gripper.py. This is a reasonable improvement — it avoids loading the extension unnecessarily when the device check will fail anyway.
🔵 Improvement: source/isaaclab_physx/isaaclab_physx/assets/surface_gripper/surface_gripper.py:446-451 — The reordering is sensible, but the exception type should be RuntimeError (not bare Exception) to follow Isaac Lab conventions for configuration/initialization errors.
Verdict: No new concerns. Debug investigation branch remains appropriate for its stated "[DO NOT MERGE]" purpose.
There was a problem hiding this comment.
🤖 Isaac Lab Review Bot
Follow-up Review
The new commit (afb26fb) adds minor changes: a sys import move in test_non_headless_launch.py and debug logging statements. The previous concern about using bare Exception instead of RuntimeError in surface_gripper.py (lines 448-451) has NOT been addressed — the code still raises Exception rather than RuntimeError.
No new issues introduced. This debug branch remains appropriate for its "[DO NOT MERGE]" investigation purpose.
There was a problem hiding this comment.
🤖 Isaac Lab Review Bot
Follow-up Review
The new commit (1488ff4) adds more debug logging to test_surface_gripper.py and includes a @pytest.mark.skipif to skip test_initialization in CI environments due to a known deadlock issue.
Previous concern NOT addressed: The surface_gripper.py still raises bare Exception (lines 448-451) instead of RuntimeError.
New observation: The skipif logic on line 175 is reasonable for CI stability during debugging, but note that os.environ.get("GITLAB_CI") returns a string (or None), not a boolean — it should be compared to a truthy value like the others. However, since any non-empty string is truthy in Python's boolean context, this works correctly in practice.
No blocking new issues. This remains an appropriate debug branch for its stated investigation purpose.
There was a problem hiding this comment.
🤖 Isaac Lab Review Bot
Follow-up Review
The new commit (4a15941) increases STARTUP_DEADLINE from 45s to 120s in tools/conftest.py and updates the docstring accordingly.
Previous concern NOT addressed: The surface_gripper.py still raises bare Exception (lines 448-451) instead of RuntimeError. This remains a code quality issue.
New changes look reasonable: The increased startup deadline (120s) is a sensible adjustment for cold CI workers where Kit startup can legitimately exceed 60s. The docstring update accurately reflects the new behavior.
No new blocking issues introduced. This debug branch continues to be appropriate for its investigation purpose.
There was a problem hiding this comment.
🤖 Isaac Lab Review Bot
Follow-up Review
The new commit (f741802) reorders code in surface_gripper.py to check self._device != "cpu" before calling enable_extension() and importing GripperView. This is a good fix — it ensures the CPU-only check fails fast without triggering the potentially deadlocking extension initialization.
Previous concern still NOT addressed: The code still raises bare Exception (line 448) instead of RuntimeError. Minor code quality issue.
New skipif decorator looks correct: The test now skips on CI environments to avoid the deadlock, while keeping the CUDA fail-fast coverage for catching device misuse. This is a reasonable workaround for debugging.
No new blocking issues. Ship it for the debug/investigation purpose of this branch.
There was a problem hiding this comment.
🤖 Isaac Lab Review Bot
Follow-up Review
No new changes since the previous review (b46f04f is the same as f741802 referenced earlier). The reordering fix in surface_gripper.py correctly moves the CPU check before enable_extension() to fail fast.
Previous minor concern still open: The code raises bare Exception instead of RuntimeError at line 448. This is non-blocking.
CI Status: All substantive jobs are still pending. The fix is logically sound — ship it once CI passes.
Description
SurfaceGripperfail fast on non-CPU simulation backends before loading the surface gripper extension.SurfaceGripperViewCPU initialization path that can deadlock, while keeping CUDA fail-fast coverage.Type of change
Checklist
pre-commitchecks with./isaaclab.sh --formatconfig/extension.tomlfileCONTRIBUTORS.mdor my name already exists there